Search CORE

18 research outputs found

An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines

Author: A. Mahéo
E. Ayguadé
F. Broquedis
F. Broquedis
L. Huang
M. Frigo
M. Ihmsen
S. Che
S. Subramaniam
S.L. Olivier
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

International audienceNowadays shared memory HPC platforms expose a large number of cores organized in a hierarchical way. Parallel application programmers strug- gle to express more and more fine-grain parallelism and to ensure locality on such NUMA platforms. Independent loops stand as a natural source of paral- lelism. Parallel environments like OpenMP provide ways of parallelizing them efficiently, but the achieved performance is closely related to the choice of pa- rameters like the granularity of work or the loop scheduler. Considering that both can depend on the target computer, the input data and the loop workload, the application programmer most of the time fails at designing both portable and ef- ficient implementations. We propose in this paper a new OpenMP loop scheduler, called adaptive, that dynamically adapts the granularity of work considering the underlying system state. Our scheduler is able to perform dynamic load balancing while taking memory affinity into account on NUMA architectures. Results show that adaptive outperforms state-of-the-art OpenMP loop schedulers on memory- bound irregular applications, while obtaining performance comparable to static on parallel loops with a regular workload

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Author: D. Buntinas
F. Broquedis
G. Mercier
S. Pellegrini
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

International audienceMulticore processors have not only reintroduced Non-Uniform Memory Access (NUMA) architectures in nowadays parallel computers, but they are also responsible for non-uniform access times with respect to Input/Output devices (NUIOA). In clusters of multicore machines equipped with several Network Interfaces, performance of communication between processes thus depends on which cores these processes are scheduled on, and on their distance to the Network Interface Cards involved. We propose a technique allowing multirail communication between processes to carefully distribute data among the network interfaces so as to counterbalance NUIOA effects. We demonstrate the relevance of our approach by evaluating its implementation within OpenMPI on a Myri-10G + InfiniBand cluster

Crossref

INRIA a CCSD electronic archive server

Parallel computation of echelon forms

Author: A. Buttari
C.-P. Jeannerod
F. Broquedis
F.G. Gustavson
J. Kurzak
J.-C. Faugère
J.-G. Dumas
J.-G. Dumas
J.J. Dongarra
J.V. Gathen
S. Toledo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceWe propose efficient parallel algorithms and implementations on shared memory architectures of LU factorization over a finite field. Compared to the corresponding numerical routines, we have identified three main difficulties specific to linear algebra over finite fields. First, the arithmetic complexity could be dominated by modular reductions. Therefore, it is mandatory to delay as much as possible these reductions while mixing fine-grain parallelizations of tiled iterative and recursive algorithms. Second, fast linear algebra variants, e.g., using Strassen-Winograd algorithm, never suffer from instability and can thus be widely used in cascade with the classical algorithms. There, trade-offs are to be made between size of blocks well suited to those fast variants or to load and communication balancing. Third, many applications over finite fields require the rank profile of the matrix (quite often rank deficient) rather than the solution to a linear system. It is thus important to design parallel algorithms that preserve and compute this rank profile. Moreover, as the rank profile is only discovered during the algorithm, block size has then to be dynamic. We propose and compare several block decomposition: tile iterative with left-looking, right-looking and Crout variants, slab and tile recursive. Experiments demonstrate that the tile recursive variant performs better and matches the performance of reference numerical software when no rank deficiency occur. Furthermore, even in the most heterogeneous case, namely when all pivot blocks are rank deficient, we show that it is possbile to maintain a high efficiency

HAL-ENS-LYON

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Hal-Diderot

NUMA Optimizations for Algorithmic Skeletons

Author: F Broquedis
F Gaud
H González-Vélez
M Cole
P-H Lin
SAM Talbot
SL Olivier
W Bolosky
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Crossref

Edinburgh Research Explorer

Task Scheduling on Manycore Processors with Home Caches

Author: A. Duran
A. Duran
F. Broquedis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Crossref

Evaluation of OpenMP Task Scheduling Algorithms for Large NUMA Architectures

Author: A. Mahéo
C. Augonnet
C. Liao
C. Terboven
F. Broquedis
M. Pérache
S.N. Agathos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Crossref

Locality Aware Task Scheduling in Parallel Data Stream Processing

Author: A. Duran
A. Kukanov
A.A. Safaei
B. Babcock
D. Bednárek
F. Broquedis
Q. Chen
Q. Jiang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Crossref

Efficient dynamic pinning of parallelized applications by reinforcement learning with applications

Author: E Bini
F Angelis De
F Broquedis
G Chasparis
GC Chasparis
K Narendra
M Dorigo
S Thibault
T Klug
Publication venue: Springer
Publication date: 01/01/2017
Field of study

This paper describes a dynamic framework for mapping the threads of parallel applications to the computation cores of parallel systems. We propose a feedback-based mechanism where the performance of each thread is collected and used to drive the reinforcement-learning policy of assigning affinities of threads to CPU cores. The proposed framework is flexible enough to address different optimization criteria, such as maximum processing speed and minimum speed variance among threads.We evaluate the framework on the Ant Colony optimization parallel benchmark from the heuristic optimization application domain, and demonstrate that we can achieve an improvement of 12% in the execution time compared to the default operating system scheduling/mapping of threads under varying availability of resources (e.g. when multiple applications are running on the same system)

Crossref

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

University of Dundee Online Publications

University of St. Andrews - Pure

Fine-Grained MPI+OpenMP Plasma Simulations: Communication Overlap with Dependent Tasks

Author: C Augonnet
E Sonnendrücker
F Broquedis
J Bueno
J Diaz
M Pérache
N Bouzat
N Crouseilles
T Gautier
X Besseron
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/08/2019
Field of study

International audienc

HAL-ENS-LYON

Crossref

INRIA a CCSD electronic archive server

Crystal and molecular structure of 3-(2-(1-hydroxycyclohexyl)-2-(4-methoxyphenyl)ethyl)-2-(4-p-methylphenyl )-1,3-thiazolidin-4-one

Author: A. Duran
A. Muddukrishna
A. Pop
C. Augonnet
F. Broquedis
H. Nakano
H. Topcuoglu
H. Vandierendonck
J. Planas
M. Frigo
R. Clint Whaley
Publication venue: Taylor & Francis Ltd
Publication date: 01/01/2009
Field of study

The compound 3-(2-(1-hydroxycyclohexyl)-2-(4-methoxyphenyl)ethyl)-2-(4-methylphenyl)- thiazolidin-4-one was synthesized. The compound crystallizes in the orthorhombic system with space group Pbca and cell parameters are a=11.795(1) angstrom, b=11.727(1) angstrom, c=33.964(9) angstrom, Z=8, V=4697.9(2) angstrom 3. The final residual factor is R1=0.0647. The molecule exhibits intermolecular hydrogen bond of type O-H center dot center dot center dot O

University of Mysore - Digital Repository of Research, Innovation and Scholarship (ePrints@UoM)

Publikationer från KTH

Crossref

Digitala Vetenskapliga Arkivet - Academic Archive On-line